Submodel Selection and Evaluation in Regression - the X - Random Case

نویسندگان

Leo Breiman

Philip Spector

چکیده

Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables using such methods as stepwise addition as deletion of variables, or "best subsets". The question is which of this sequence of submodels is "best", and how can submodel perfornance be evaluated. This was explored in Breiman [1988] for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying any prediction equation to the distributional universe of (y,x) values. This definition is used throughout to compare various submodels. There are startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as Cp, adjusted R2, etc. turn out to be highly biased and almost worthless methods for submodel selection. The two best methods are cross-validation and bootstrap. One surpnse is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. * Work supported by NSF Grant No. DMS-8718362. Dans l'analyse de problemes de regression 'a plusieurs variables (independentes), on produit souvent une serie de sous-modeles constitues d'un sous-ensemble des variables par des methodes tels que l'addition par etope, le retroit par e'tope et la methode du "meilleurs sous-ensemble". Le probl'eme est de determiner liquel de ces sousmod'eles est "le meilleux et d'evaluer sa performance. Ce probleme fut explore dans Breiman [1988] dans le cas d'une matrice X fixe. Dans ce qui suit, on consideire le cas de la matrice X etant aliatoire. La determination de resultats analytiques est dificile, si non impossible. Hors cet(te) etude implique des simulations de grande ervergure. cet(te) etude se base sur la d6finition theorique de l'erreur de prediction (PE) comme etant l'esperance du carre de l'erreur produite en applicant une equation de prediction 'a l'inverse distributionel des voleurs (y,x). cette definition est utilisee afin de comparen divers sous-modeles. La difference entre les cas de la matrice X fixe et aleatoire est remarkable et diff6rents estimateurs du PE s'appliquent. Les estimateurs n'utilisant pas de reechantillonage, tels que le Cp et le R2 ajuot6, produisent des methodes de selection grandement biaisees. Leo deux meilleurs methodes sant cross-validation et l'autoarmarcaze bootstrap. Une surprise est que S-fold cross-validation est mieux que leave-one-out cross-validation. I1 y a falusieurs outres resultats surprenants.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regression quantiles and trimmed least squares estimator under a general design

where X = (X1, ...,X„)' is the vector of independent observations, C is the n x p design matrix, /} = (ftu ...,PP)' is the vector of unknown parameters and £ = = ( £ ] , . . . , E„)' where Eu ..., E„ are independent and identically distributed (i.i.d.) random variables with a continuous distribution function (d.f.) F. Our main interest is in robust estimating the parameter /?. For the location ...

متن کامل

Identification of Genetic Polymorphism Interactions in Sporadic Alzheimer’s Disease Using Logic Regression

Objectives: Genetic polymorphism interactions are among the important factors in affliction with complex diseases like Alzheimer’s disease. The important goal of genetic association studies is to identify a combination of polymorphisms and measure their importance in increasing the risk of occurrence of such diseases. In this study, feature selection approach of logic regression was used to ide...

متن کامل

A Fuzzy Approach for Projects Evaluation and Selection an Iranian Auto Manufacturer Case Study

Evaluating and selecting alternatives investment projects needs considering all relevant and important aspects. In traditional methods, the focus is just on tangible monetary criteria. Also in the traditional methods, either all the information’s about factors must be known precisely or sufficient objective data must be available for applying probability theory. In this paper, a combinative app...

متن کامل

A Universal Selection Method in Linear Regression Models

In this paper we consider a linear regression model with fixed design. A new rule for the selection of a relevant submodel is introduced on the basis of parameter tests. One particular feature of the rule is that subjective grading of the model complexity can be incorporated. We provide bounds for the mis-selection error. Simulations show that by using the proposed selection rule, the mis-selec...

متن کامل

سودمندی رگرسیون‌های تجمیعی و روش‌های انتخاب متغیرهای پیش‌بین بهینه در پیش‌بینی بازده سهام

مقاله حاضر به بررسی سودمندی رگرسیون‌های تجمیعی و روش‌های انتخاب متغیرهای پیش‌بین بهینه (شامل روش مبتنی بر همبستگی و ریلیف) برای پیش‌بینی بازده سهام شرکت‌های پذیرفته شده در بورس اوراق بهادار تهران می‌پردازد. به‌منظور ارزیابی عملکرد رگرسیون تجمیعی، معیارهای ارزیابی (شامل میانگین قدرمطلق درصد خطا، مجذور مربع میانگین خطا و ضریب تعیین) مربوط به پیش‌بینی این روش، با رگرسیون خطی و شبکه‌های عصبی مصنوعی...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Submodel Selection and Evaluation in Regression - the X - Random Case

نویسندگان

چکیده

منابع مشابه

Regression quantiles and trimmed least squares estimator under a general design

Identification of Genetic Polymorphism Interactions in Sporadic Alzheimer’s Disease Using Logic Regression

A Fuzzy Approach for Projects Evaluation and Selection an Iranian Auto Manufacturer Case Study

A Universal Selection Method in Linear Regression Models

سودمندی رگرسیون‌های تجمیعی و روش‌های انتخاب متغیرهای پیش‌بین بهینه در پیش‌بینی بازده سهام

عنوان ژورنال:

اشتراک گذاری